Github: https://github.com/WalkerMao/Data-Wrangling-final-project

In this data set, there are sales prices and another 79 explanatory variables describing almost every aspect of residential homes, including housing structure and decoration condition, surrounding facilities of housing, and several indices indicating real estate market. I used these variables to predict the sales prices.

In this data set, each house has a unique id and every house is described in 79 predictor variables such as MSSubClass (The building class), Street (Type of road access), YearBuilt (Original construction date), etc. The response variable is SalePrice, depicting the property’s sale price in dollars, which is the target variable that I tried to predict.

Data exploration and transformation

Overview of the data

Read the data

Delete the column Id since We do not need this feature. Divide this training data into xtr and ytr.

Dimension

## [1] "There are 1460 training samples and 79 explanatory features."

Types of explanatory variables

## 
## character   numeric 
##        43        36

Missing data processing

There are many NA in this data set. The plot below shows the proporation of the NA for each variables that have NA.

However, most of them are meaningful but not missing data. For exmaple, the NA in the variables of BsmtQual, BsmtCond, GarageQual, GarageCond mean that this house do not have a basement or a garage.

After reading the data description, I fill in these meaningful NA with ‘None’. For the missing data, I fill in the NA of numerical variables with median value, and fill in the NA of categorical variables with mode value.

Label encoding

Some categorical variables may have numerical order, which means they should be ordinal variables. I encode labels for these variables. For example, ‘Ex’(excellent)=5, ‘Gd’(good)=4, ‘TA’(typical)=3, ‘Fa’(fair)=2, ‘Po’(poor)=1, ‘None’=0.

Log transformation of target variable (SalePrice)

The target variable is SalePrice, which is a numerical variable. The 2 plots in the left part below are the density distribution and QQ plot for original SalePrice. We can observe that it is kind of like skew normal distribution.

Let us try log-transformation to reduce the skewness. The 2 plots in the right part below are the density distribution and QQ plot for log(1+SalePrice). It is much better after log-transformation, with respect to the skewness and QQ plot.

Box-Cox transformation

There are many variables that are highly skewed. I used the Box-Cox transformation to reduce the skewness.

\[ { x }_{ \lambda }^{ ' } = \frac { { x }^{ \lambda } - 1 }{ \lambda }, when \ \lambda \neq 0. \] \[ { x }_{ \lambda }^{ ' } = log(x), when \ \lambda = 0. \]

I use for-loops to search the parameter \(\lambda\), and save the data to csv file before and after the Box-Cox transformation.

The plots below are the density distribution of numerical variables. The previous plot is the data before box-cox transformation, and the latter one is the data after box-cox transformation.

The darker plots refer to the higher skewness. We can see that the latter plot (after box-cox) are lighter than the previous one (before box-cox), which means the box-cox transformation significantly reduce the skewness of these numerical variables.

## [1] "Density distribution before Box-Cox transformation: "

## [1] "After Box-Cox transformation: "

Get the dummy variables

There are some categorical variables that do not have numerical order, we cannot transform them to ordinal variables. I converted them to dummy variables.

After these data transformations, all of the data are in the numerical format.

Standardize data to standard Score (Z score)

Different variables have different scales, which may have impacts on our models.

As for regularized linear regression, we need to penalize the size of the coefficients, which will be affected by the different scales of variables. So it is necessary to standardize the variables to eliminate the influence of the different scales.

I centered the data and changed the units to standard deviations by subtracting mean and dividing by standard deviation.

\[ x_i^* = \frac {x_i - mean(x)} {sd(x)} \]

Save the transformed data

All of the transformations are done. Now I save the data to csv file.

## [1] "After data transformation, there are 227 explanatory variables."

Correlation matrix

Correlation is a significant measurement for the importance of variables. The variables in the plot blow are ordered by its absolute value of correlation with the response variable log1p_SalePrice. Since there are too many variables, I select out the variables whose absolute value of correlation with log1p_SalePrice are greater than 0.5.

We can notice that the variable OverallQual(overall quality) has the highest correlation with log1p_SalePrice, which means it is highly positive correlated with our target variable.

## [1] "====== High influential variables: ======="
##  [1] "log1p_SalePrice"  "OverallQual"      "GrLivArea"       
##  [4] "GarageCars"       "ExterQual"        "BsmtQual"        
##  [7] "KitchenQual"      "GarageArea"       "TotalBsmtSF"     
## [10] "YearBuilt"        "1stFlrSF"         "GarageFinish"    
## [13] "FullBath"         "YearRemodAdd"     "TotRmsAbvGrd"    
## [16] "Foundation_PConc" "Fireplaces"       "FireplaceQu_0"

Regression plots for important variables

For the plots below, I use the top 12 influential variables according to their correlation with log1p_SalePrice.

The formula on the top of each plot is the simple linear regression fit model and its corresponding R-square. The skyblue points are the scatter plots. The blue lines are the simple linear regression fit lines. The tan lines are the smooth curve fitted by Loess.

Modeling

Since there are too many explanatory variables, we should use regularized linear regression model. We use cross valiation to select the best parameter \(\lambda\).

LASSO

## [1] "Best lambda for LASSO: 0.005432."
## [1] "====== Variables with top 20 largest absolute value of coeficients. ======"
##                                1
## (Intercept)          12.01693480
## GrLivArea             0.12015387
## OverallQual           0.07836435
## YearBuilt             0.04749840
## OverallCond           0.03663626
## LotArea               0.03568582
## MSZoning_C (all)     -0.02832925
## BsmtFinSF1            0.02809470
## RoofMatl_ClyTile     -0.02698366
## GarageCars            0.02354650
## TotalBsmtSF           0.02156035
## 1stFlrSF              0.01925851
## Neighborhood_NridgHt  0.01811729
## Condition2_PosN      -0.01793905
## Neighborhood_Crawfor  0.01734320
## BsmtQual              0.01712859
## KitchenQual           0.01665728
## GarageArea            0.01655244
## Fireplaces            0.01599399
## Functional            0.01495100
## [1] "Training RMSE of LASSO: 0.108667."

The test RMSE of LASSO: 0.12575.

Ridge regression

## [1] "Best lambda for ridge regression: 0.090618."
## [1] "====== Variables with top 20 largest absolute value of coeficients. ======"
##                                1
## (Intercept)          12.01568587
## GrLivArea             0.04818515
## OverallQual           0.04658009
## 1stFlrSF              0.03726617
## TotalBsmtSF           0.02961360
## OverallCond           0.02934778
## MSZoning_C (all)     -0.02925493
## LotArea               0.02525563
## RoofMatl_ClyTile     -0.02436350
## GarageCars            0.02204304
## TotRmsAbvGrd          0.01972408
## Neighborhood_NridgHt  0.01962357
## BsmtQual              0.01952029
## GarageArea            0.01947854
## FullBath              0.01946479
## Neighborhood_Crawfor  0.01785308
## YearBuilt             0.01712255
## KitchenQual           0.01676403
## Condition2_PosN      -0.01661608
## Neighborhood_StoneBr  0.01646154
## [1] "Training RMSE of ridge regression: 0.102999."

The test RMSE of ridge regression 0.13306.

Summary

From the test RMSE, we can conclude that LASSO is better than ridge regression for this problem.

From the coeficients of both two reguralized linear regression models, the variables GrLivArea, OverallQual, 1stFlrSF, TotalBsmtSF, OverallCond and LotArea have signigiantly positive influence of response variable SalePrice. The variables MSZoning_C (all) and RoofMatl_ClyTile have signigiantly negative influence.